Skip to content

feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845

Open
KayMKM wants to merge 13 commits into
mainfrom
yuesu/build-only-ep-matrix
Open

feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845
KayMKM wants to merge 13 commits into
mainfrom
yuesu/build-only-ep-matrix

Conversation

@KayMKM

@KayMKM KayMKM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a --build-only mode to scripts/e2e_eval/run_eval.py that generates ModelKit pipeline artifacts (export → optimize → quantize, no compile) across the full EP matrix, without requiring EP hardware, and optionally streams them to the Modelkit Azure Artifacts feed while bounding local disk usage.

Motivation: we need a way to mass-produce per-EP model artifacts for ~200 models on a single (CPU) box and distribute them, but (a) the build normally needs the target EP installed, and (b) writing every stage for every (model × EP) fills the disk fast.

What's included

1. --build-only mode + per-EP matrix

  • Runs winml config + winml build --no-compile per model; perf/accuracy are skipped.
  • When --ep/--device are omitted, builds the eval EP matrix into <model_dir>/<ep>_<device>/ subdirs: qnn_npu, qnn_gpu, ov_cpu, ov_npu, ov_gpu, mlas_cpu, dml_gpu, vitisai_npu. Pinning --ep/--device writes a single build directly into <model_dir>.
  • Precision per combo reuses the existing eval policy (NPU → w8a16, CPU/GPU → auto, native-quant EPs → --no-quant).

2. Cross-EP / cross-host builds (core fix in build.py)

  • _run_optimize_stage called resolve_device(device, ep=ep) purely to pick a progress-bar key, which raised when the target EP wasn't installed locally — blocking offline generation of (e.g.) QNN/VitisAI artifacts on a CPU box.
  • Now: when the build won't compile (config.compile is None), the missing-EP lookup soft-fails and falls back to the requested device (optimize/quantize only need the EP's static rule data, not a registered EP). When compile will run, it still fails fast. No behavior change for normal compiling builds.

3. Export dedup (disk saver)

  • The export.onnx stage is EP/device-independent, so all 8 combos produce an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to <model_dir>/_shared/, later identical ones are deleted — one export copy instead of 8 (export is the largest, full-precision artifact).

4. --upload: stream to Azure Artifacts feed, then delete locally

  • After a model's combos are built, publishes the whole model dir to the Modelkit feed as a Universal Package, then deletes the local copy — peak disk stays at ~one model's matrix.
  • Auth via az login (Entra ID), no PAT. The azure-devops extension is ensured and login verified up front; if not ready, the run aborts (so disk isn't silently filled).
  • Package: single name winml-cli-models, one version per model: 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; the 0.0.0- core keeps it legal, the date stamp + slug are the pre-release segment). The shared run-stamp groups a batch.
  • Resume: --continue + the same --run-stamp skips already-uploaded models without rebuilding them. Already-uploaded models are detected two ways: a local build_only_uploads.json manifest, and a query against the feed itself at startup (versions matching 0.0.0-<run-stamp>-*). Because the manifest is only written after a successful upload, a fresh --output-dir would otherwise start empty and rebuild everything; seeding from the feed makes it authoritative for what's published, so resume works regardless of local state. A feed-query failure falls back to local-manifest-only behavior.
  • Extra flags: --run-stamp, --keep-local, --upload-skip-existing, --feed/--feed-org/--feed-project/--package-name.

Usage

# Build the full EP matrix for P0 models, stream to the feed, delete locals
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --priority P0

# Resume an interrupted batch (same run-stamp; a fresh output dir is fine)
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --continue \
  --run-stamp 20260609 --priority P0

# Local only (no feed), or pin a single EP/device
uv run python scripts/e2e_eval/run_eval.py --build-only --hf-model microsoft/resnet-50

Download a specific model's specific file later:

az artifacts universal download \
  --organization https://dev.azure.com/microsoft --project windows.ai.toolkit \
  --scope project --feed Modelkit --name winml-cli-models \
  --version 0.0.0-20260609-microsoft-resnet-50-image-classification \
  --path ./out --file-filter 'qnn_npu/model.onnx*'

Notes / scope

  • src/winml/modelkit/commands/build.py change is gated on config.compile is None, so it only affects no-compile builds; compiling builds are unchanged and still fail fast on a missing EP.
  • --continue resumes from the feed (versions matching 0.0.0-<run-stamp>-*) and the local build_only_uploads.json manifest, so a resumed run only needs the same --run-stamp — it no longer has to reuse the same --output-dir. The feed query uses two &-free REST GETs (list packages → resolve the UPack package GUID → list versions) because az resolves to az.cmd and cmd.exe splits query strings on &, dropping every parameter after the first.
  • Verified manually end-to-end on a CPU host: full 8-EP matrix build, export dedup, publish to Modelkit, and --file-filter download of individual model.onnx files.
  • The feed: https://dev.azure.com/microsoft/windows.ai.toolkit/_artifacts/feed/Modelkit/UPack/winml-cli-models/overview/0.0.0-20260609-prajjwal1-bert-tiny-text-classification

KayMKM added 3 commits June 5, 2026 16:44
Add a --build-only mode to run_eval.py that runs config + build with
--no-compile, writing each pipeline stage's ONNX (export/optimize/quantize)
without requiring execution-provider hardware. Perf and accuracy are skipped.

When --ep/--device are omitted, every model is built once per EP in the
build-only matrix (qnn npu/gpu, openvino cpu/npu/gpu, mlas, dml, vitisai)
into <model_dir>/<ep>_<device>/ subdirs. When either is pinned, a single
build is written directly into the model dir. Precision per combo reuses
the existing _resolve_precision policy (NPU w8a16, CPU/GPU auto, native-quant
EPs unquantized).

Reuses the existing _run_build via a build_only flag (-o <dir> --no-compile
instead of --use-cache).
Two bugs surfaced when running `run_eval.py --build-only` against the EP matrix on a CPU-only host:

1. Every combo for the 'no native EP' subset (mlas/dml/openvino) was reported as `[FAIL @ complete]` even though export/optimize/quantize/model.onnx all landed correctly. `_run_build` was funnelling build-only results through `_extract_onnx_path`, which scans stdout for a `Final artifact:` marker that `winml build --no-compile` never prints, and falls back to the global cache which build-only doesn't populate (`-o <dir>` writes elsewhere). In build-only mode there is no downstream consumer of the path, so trust exit-code 0 directly and record `build_out` to keep the per-component bookkeeping balanced.

2. QNN/VitisAI combos failed at the optimize stage with `Requested EP 'qnn' is not available on this system`. `_run_optimize_stage` calls `resolve_device(device, ep=ep)` purely to pick the right `has_rule_data_for_ep` key for the progress bar, but that helper raises when the EP isn't installed locally -- even when the rest of the pipeline (export + optimize + quantize) runs on CPU and the EP is only needed at compile time. Soft-fail the lookup *only when* `config.compile is None` (i.e. `--no-compile` or a config that explicitly opts out); otherwise re-raise so configs that will compile still fail fast here instead of deep inside the compile stage.

Also moves `--clean-cache` from per-combo to per-model in `_run_build_only`: combos for the same model share the same HF download, so clearing between combos forced N redundant re-downloads of the same weights.
…facts feed

Running --build-only over the 8-EP matrix for many models fills local disk.
Two additions keep disk bounded:

1. Export dedup: the export.onnx stage is EP/device-independent, so every
   combo produces an identical export. After each combo builds, its export is
   hash-compared against a per-model canonical: the first is moved to
   <model_dir>/_shared/, later identical ones are deleted. One export copy on
   disk instead of 8.

2. --upload: after a model's combos are built, publish the model dir to the
   Modelkit Azure Artifacts feed as a Universal Package version, then delete it
   locally. Auth via az login (no PAT); the azure-devops extension is ensured
   and login verified up front (aborts otherwise so disk isn't silently filled).
   Version is 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; date stamp
   groups a batch). --continue + --run-stamp resume an interrupted batch from
   the build_only_uploads.json manifest without rebuilding uploaded models;
   --keep-local, --upload-skip-existing, and feed/package args round it out.
@KayMKM KayMKM requested a review from a team as a code owner June 9, 2026 07:57
@KayMKM KayMKM changed the title Yuesu/build only ep matrix feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload Jun 9, 2026

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the full diff. Overall structure is solid and docstrings are thorough. Found a potential data-loss path (false-positive conflict detection triggers local directory deletion), a silent no-op for --continue in the non-upload build-only path, and a few other correctness/usability issues — see inline comments.

Comment thread scripts/e2e_eval/run_eval.py Outdated
Comment thread scripts/e2e_eval/run_eval.py
Comment thread scripts/e2e_eval/run_eval.py
Comment thread scripts/e2e_eval/run_eval.py Outdated
Comment thread scripts/e2e_eval/run_eval.py
Comment thread src/winml/modelkit/commands/build.py Outdated
KayMKM added 6 commits June 10, 2026 12:03
… versions

The --continue skip logic only consulted the local build_only_uploads.json manifest, which is written after each successful upload. A fresh --output-dir (e.g. a gitignored temp dir) starts empty, so models already published to the Azure Artifacts feed under the same run-stamp were rebuilt and re-uploaded instead of being skipped.

Seed the in-memory manifest from the feed at startup: query the feed REST API for versions matching 0.0.0-<run-stamp>-* and mark them as uploaded so the existing skip check honors them. The feed is now authoritative for what's published, regardless of local state. Querying is best-effort -- a failure falls back to local-manifest-only behavior.

Use two ampersand-free REST GETs (list packages -> resolve UPack package GUID -> list versions) because az resolves to az.cmd and cmd.exe splits query strings on '&', dropping every parameter after the first.
- _hash_files: stop hashing unreadable files to a fixed sentinel; propagate
  OSError and have _dedup_export keep the export in place instead of risking
  deletion of an artifact never verified identical.
- _is_publish_conflict: narrow detection to specific version-exists / HTTP 409
  markers (drop bare 'conflict'/'409') so an unrelated message can't trigger
  exists-skipped and rmtree the local model dir.
- build.py _run_optimize_stage: narrow the no-compile EP fallback to only
  swallow EP-not-available ValueErrors; re-raise malformed device/EP names.
- Warn when --continue is used with --build-only but without --upload (no
  local-disk resume exists, so everything is rebuilt).
- Document that the pinned-EP auto-device path delegates precision to winml
  config's auto-detection.
- Fix misleading --upload-skip-existing help: it does not skip the build.
When a mid-run upload failed because Azure CLI was unavailable (not logged in, token expired, or az hung and was killed), the model was marked 'failed' and kept locally while the batch continued. Every subsequent model hit the same az failure and its local copy piled up, filling the disk (a single 7B LLM matrix is ~450 GB).

Add _is_az_unavailable() to distinguish a host-level az/login problem from a per-package publish error (network blip or version conflict), and abort the run immediately (exit 3) when an upload fails for that reason. The user re-runs 'az login' and resumes with --continue + the same --run-stamp; already-uploaded models are skipped.
Uploading all 8 EP/device combos as one Universal Package version timed out on
large models and left the local artifacts behind, filling the disk. Each combo
is now built, uploaded, and deleted on its own.

- Version is per combo: 0.0.0-<run-stamp>-<ep>-<device>-<model-slug>, so each
  package is small (lower timeout risk) and can be retried/resumed on its own.
- Local artifacts are removed after every outcome (uploaded, version-exists,
  timeout, upload-failed, build-failed) unless --keep-local, so disk stays
  bounded. A timeout/failure is recorded and the run continues; only a
  host-level az failure (not logged in / token expired) aborts.
- Every (model, combo) outcome is written to build_only_results.json (replaces
  build_only_uploads.json) for auditing, with or without --upload; it also
  drives per-combo --continue resume.
- Export dedup now applies only without --upload (each uploaded combo is
  self-contained).

Adds _classify_upload helper and unit tests for the version format, results
I/O, az-unavailable classification, and the per-combo cleanup/timeout/abort
orchestration.
…atrix

# Conflicts:
#	src/winml/modelkit/commands/build.py

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 re-review: the previous five issues are all addressed -- OSError propagation in _hash_files, narrow conflict markers, --continue warning, per-combo upload design, and the build.py ep_unavailable guard. Remaining findings: one correctness bug (pagination truncation in _fetch_feed_versions silently breaks --continue resume on large feeds) and two lower-severity concerns.

Comment thread scripts/e2e_eval/run_eval.py
Comment thread scripts/e2e_eval/run_eval.py Outdated
Comment thread scripts/e2e_eval/run_eval.py
Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread scripts/e2e_eval/run_eval.py Fixed
KayMKM added 3 commits June 26, 2026 15:25
- _fetch_feed_versions: return None (not an empty set) when the package is
  absent from the /packages listing. That listing is paginated (~25/page
  default) and $top can't be appended here -- a second query param needs `&`,
  which az.cmd/cmd.exe splits -- so an empty set was indistinguishable from a
  pagination miss and silently rebuilt the whole batch on --continue. None
  takes the explicit "could not query feed" fallback instead.
- _is_az_unavailable: narrow the broad 'refresh token' marker to 'invalid_grant'
  (the OAuth2 code Azure AD emits for an expired/revoked refresh token), so an
  informational MSAL "Refreshing token..." line on a transient exit-1 can no
  longer abort the entire run.
- Document that --upload-skip-existing is the safe flag on a --continue resume:
  a timed-out upload may have committed server-side, so the retry's 409 should
  count as done, not failed (README example + flags table, the flag's help
  text, and the _classify_upload docstring).

Adds unit tests for the narrowed marker and for _fetch_feed_versions
not-found -> None / matching-versions / query-failure.
CodeQL flagged the best-effort 'remove now-empty model dir' cleanup in _run_build_only as an empty 'except OSError: pass'. Add an explanatory comment (no behaviour change) so the intent -- ignoring a non-empty/locked dir -- is documented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants